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Abstract 

Background: Chromosonne ends are composed of telomeric repeats and subtelomeric regions, which are 
patchworl<s of genes interspersed with repeated elements. Although chromosome ends display similar arrangements 
in different species, their sequences are highly divergent. In addition, these regions display a particular nucleosomal 
composition and bind specific factors, therefore producing a special kind of heterochromatin. Using data from 
currently available draft genomes we have characterized these putative Telomeric Associated Sequences in 
Toxoplasma gondii. 

Results: An all-vs-all pairwise comparison of 7". gond/V assembled chromosomes revealed the presence of conserved 
regions of 30 Kb located near the ends of 9 of the 14 chromosomes of the genome of the 1\/1E49 strain. Sequence 
similarity among these regions is ~ 70%, and they are also highly conserved in the GTl and VEG strains. However, they 
are unique to Toxoplasma with no detectable similarity in other Apicomplexan parasites. The internal structure of 
these sequences consists of 3 repetitive regions separated by high-complexity sequences without annotated genes, 
except for a gene from the Toxoplasma Specific Family. ChlP-qPCR experiments showed that nucleosomes associated 
to these sequences are enriched in histone H4 monomethylated at K20 (H4K20mel ), and the histone variant H2A.X, 
suggesting that they are silenced sequences (heterochromatin). A detailed characterization of the base composition 
of these sequences, led us to identify a strong long-range compositional bias, which was similar to that observed in 
other genomic silenced fragments such as those containing centromeric sequences, and was negatively correlated to 
gene density. 

Conclusions: We identified and characterized a region present in most Toxoplasma assembled chromosomes. Based 
on their location, sequence features, and nucleosomal markers we propose that these might be part of subtelomeric 
regions of T gondii. The identified regions display a unique trinucleotide compositional bias, which is shared (despite 
the lack of any detectable sequence similarity) with other silenced sequences, such as those making up the 
chromosome centromeres. We also identified other genomic regions with this compositional bias (but no detectable 
sequence similarity) that might be functionally similar. 

Keywords: Toxoplasma gondii, Telomeric Associated Sites (TAS), Subtelomeric heterochromatin. Trinucleotide 
compositional bias 
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Background 

Toxoplasma gondii is a widespread obligate intracellu- 
lar protozoan parasite, member of the phylum Apicom- 
plexa. T, gondii has been recognized as an important 
pathogen for humans, particularly during pregnancy and 
for immunocompromised patients [1]. Toxoplasmosis has 
been also documented as an economically important dis- 
ease that has considerable impact on the livestock indus- 
try [1,2]. It was for these reasons that T, gondii was one 
of the first protozoan parasites chosen for a genome- 
sequencing project. And more recently, the genomes of 
other strains of T. gondii were sequenced [3]. 

T, gondii presents a haploid genome of ~63 Mb, which 
is organized in 14 chromosomes that are well conserved 
in length and number among different strains [4,5]. The 
genomic DNA sequence, and the way it is organized in 
the nucleus are fundamental for the correct regulation of 
cell processes. The level of chromatin condensation, its 
nucleosome composition and positioning, together with 
their binding to non-histone nuclear proteins generates 
the different states of chromatin [6]. Euchromatin is a 
gene-rich decondensed chromatin, where transcription is 
facilitated, whereas heterochromatin is a gene-poor con- 
densed chromatin, refractory to transcription. A general 
and very simplified rule is that euchromatin is active chro- 
matin where nucleosomes are enriched in histones H3 
and H4 acetylated (H3ac/H4ac), and H3 tri-methylated 
at lysine 4 (H3K4me3); whereas constitutive heterochro- 
matin nucleosomes are enriched in histone H3 di- and 
tri-methylated at lysine 9 (H3K9me2/3), which binds Het- 
erochromatin Protein 1 (HPl) forming a compact chro- 
matin. These epigenetic marks are conserved from yeast 
to humans, and T, gondii is not an exception [7-10]. Also, 
histones H2A and H2B, which are less conserved than H3 
and H4, have several variants that contribute differently 
to the chromatin state [11]. We have previously charac- 
terized histones H2A and H2B in T, gondii^ and found the 
presence of a histone H2B variant (H2Bv) which is only 
present in protozoans [12], recently renamed as H2B.Z 
[13], and two H2A variants: H2A.X and H2A.Z [14]. We 
have also observed that H2A.Z and H2B.Z are enriched 
in active promoters, whereas H2A.X is associated to 
silenced promoters and heterochromatin [14]. Thus, these 
modified histones and histone variants can be used in 
Toxoplasma as epigenetic markers of euchromatin and 
heterochromatin. 

In constitutive heterochromatin, two specialized domains 
can be readily identified, the centromere and the telom- 
eres. The centromere is a genetic locus, where the spindle 
fibers attach to the chromosomes forming the kineto- 
chore, and is required for proper chromosome segre- 
gation (reviewed in [15]). Centromeric DNA sequences 
are not conserved among species, and they differ even 
between chromosomes in the same organism. However, 



the centromeric chromatin in all organisms studied to 
date, contains the histone H3 variant CENP-A [15,16]. 
The telomeres, on the other hand, are located at the chro- 
mosome ends and contain the telomeric repeats and the 
subtelomeric region, sometimes denominated Telomeric 
Associated Sequence (or TAS). Basically, telomeres are 
composed of a tract of simple repeats (like TTAGGG in 
humans and trypanosomatids), followed by the subtelom- 
ere that comprises repetitive elements and, in some cases, 
subtelomeric genes [17,18]. In general, these subtelom- 
eric genes are associated with different stress responses, 
whose expression is regulated by the Telomeric Position 
Effect [17]. Telomeres and subtelomeres replicate late in 
the S -phase of cell cycle [19,20]. The repetitive elements 
seem to be involved in blocking replication initiation 
in the subtelomeric regions [20-22], and favoring their 
nuclear-periphery localization [20,23] . Although the fea- 
tures of chromosome ends are highly conserved, the DNA 
sequences are specific to each species [17]. TASs, as well 
as centromeres, display a specialized type of chromatin, 
with a particular nucleosome composition and a set of 
associated proteins responsible of its highly condensed 
state [18,24]. In Toxoplasma, only a few heterochromatin- 
associated proteins have been described to date, including 
the histone markers mentioned above [8,14,25], and the 
recently described TgChromol with a peri-centromeric 
localization [10]. Concerning constitutive heterochromatin, 
only centromeres were identified. Centromeric sequences 
were determined as genomic regions enriched in cenH3 
{Toxoplasma CENP-A homologous) [26]. Thus, T gondii 
heterochromatin characteristics, composition, regulation, 
and distribution across the genome are still to be discovered. 

In many protozoan pathogens, such as Trypanosoma 
brucei, Leishmania major, Plasmodium falciparum and 
some yeast, the subtelomeric genes are contingency genes 
[24]. These are sets of genes responsible for pathogen 
diversity and for the clonal phenotype switches often 
associated with the parasite's escape from the host 
immune system. In the Apicomplexan parasite R falci- 
parum, chromosome ends consist of the telomeric repeat 
(T(G/A)AAGGG)n followed by a telomere-associated 
repetitive elements (TARE 1-6) organized in six blocks 
flanked with the members of the van stevor and rif multi- 
gene families responsible for the antigenic variation and 
cytoadhesion [27-29]. TASs in this parasite also present 
a specialized chromatin, different from the rest of the 
genome, rich in nucleosomes containing H3K9me3 and 
the histone deacetylase PfSir2 [30,31]. 

Despite the importance and growing evidence of the 
effect of telomeres and subtelomeres on the expres- 
sion of surrounding genes and DNA replication timing 
modulation, in Toxoplasma telomeres have not yet been 
analyzed; probably because clustered contingency genes 
at chromosome ends have not been described in this 
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apicomplexan parasite, and/or because the chromosome 
ends are not completely assembled in the current available 
genomes. In this study we describe putative subtelom- 
eric sequences in T. gondii assembled chromosomes, their 
nucleosome and nucleotide composition. 

Results 

Several Toxoplasma chromosome ends display significant 
sequence similarity 

Subtelomeric sequences show extensive similarity across 
multiple chromosomes [32,33]. This has also been 
observed in other Apicomplexa parasites such as Plas- 
modium [28], where they have been called "Telomeric 
Associated Sequences" (TASs). Therefore, we decided to 
use an all-vs-all BLAST comparison of assembled chro- 
mosomes to identify putative subtelomeric sequences 
in Toxoplasma, We validated this bioinformatic strategy 
using the R falciparum 3D7 genome assembly (Plas- 
moDB v8.2, where TASs were already identified [28]). In 
Plasmodium we observed that similarities among chro- 
mosomes were restricted to telomeric and subtelomeric 
sequences (Additional file lA), and to regions in the center 
of the chromosomes 4, 6, 7, 8, and 12 that also contain sub- 
telomeric genes, and are enriched in H3K9me3 a histone 
mostly restricted to telomeres in P, falciparum [30]. We 
next performed the same all-vs-all BLAST comparison 
using Toxoplasma chromosomal assemblies of the ME49 
genome (TgME49), obtained from ToxoDB version 7.3 [3]. 
As a result, we identified a conserved region of approxi- 
mately 30 Kb at the end of 9 out of the 14 chromosomes 
with high identity among them (Figure lA, Additional 
file IB). Because these sequences present the same pattern 



of sequence similarity observed in P. falciparum sub- 
telomeric regions, we denominated them T gondii TAS- 
like sequences (TgTAS-like or TgTASL). These TAS- 
like regions were found in T, gondii chromosomes la, 
II, III, IV, V, IX, X, XI and XII; with chromosomes III 
and X containing two such regions each. The similarity 
among the different TgTASL and their localization can 
be observed in Figure lA. In the figure, chromosomes 
are laid out in a circle, and blocks of sequence similar- 
ity longer than 180 bp and with >80% sequence iden- 
tity are represented with ribbons; those corresponding to 
TgTASL sequences are highlighted in red. Consequently, 
black ribbons in Figure 1 correspond to other inter- 
chromosome repeated sequences, which most of them 
correspond to the mitochondrial-like repeats (data not 
shown) [34]. Interestingly, at the end of chromosome XI 
and the begining of Vllb there are short sequences similar 
to TgTASL (red links Figure lA, Additional file IB). These 
sequences were not included in our analysis because 
they were too short, but they could represent TgTAS- 
like regions that were not correctly assembled. Figure 1 A 
also shows an additional data track corresponding to the 
annotated genes in the ME49 genome. A closer inspec- 
tion of the TgTASL regions shows that they are depleted 
of genes, with the exception of only one gene located 
next to one end of the TgTASL region (blue arrows in 
Figure IB). 

The putative Toxoplasma Telomeric Associated Sequences 
have a conserved structure 

Subtelomeric sequences often display a pattern of DNA 
repeat blocks, linker regions and sometimes genes. To 




Figure 1 Telomeric-associated sequences are shared between Toxoplasma chromosome ends. A) Circular ideogram for genome 
visualization using circos [62]. Chromosomes (l-XII) were laid out in order and labeled with different colors. Centromeres (in green), and TgTAS-like 
regions (in red) are highlighted in the central white boxes. 50 Kb regions containing the TgTAS-like sequences were zoomed (xl 0) (see the ruler 
above chromosomes). Sequence similarity is depicted as ribbons connecting matching regions (red ribbons for TAS-like regions, black ribbons for all 
other regions). The two inner grey circles show the location of annotated genes in black. B) a section of the visualization showing a magnification of 
chromosomes II and III. Blue arrows indicate the presence of only one gene at the end of each TgTAS-like block. 
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uncover such patterns in our TgTASL regions, we per- 
formed detailed dotplots for the pairwise comparisons of 
them using Dotter (see Additional file 2), [35]. These plots 
allowed us to define the limits and organization of each 
TgTASL region. These limits were arbitrarily restricted to 
the shared (consensus) sequence present in all TgTAS-like 
regions. The location of these regions in each chromo- 
some is shown in Table 1. We named each TgTASL based 
on the chromosome number where it is located (for exam- 
ple TgTASLJa is the TgTASL of chromosome la). In cases 
where more than one TgTAS-like was present in a chro- 
mosome, we added a letter suffix ("-a" "-b" etc.) to reflect 
this fact (for example in chromosome III there is one 
TgTASL_III-a near the chromosome end and a second 
TgTASL_III-b towards the centromere, see Figure lA, and 
Table 1). It is important to highlight that there are other 
genomic sequences that were excluded from our analy- 
sis because they are shared among a few chromosomes 
but do not present a conserved pattern (for example 
the sequences between TgTASL_III-a and _III-b which 
are also present at the end of chromosome IX and la). 
Nevertheless, this observation suggests that TgTAS-like 
regions could be larger than we describe here and/or 
that other sequences with similar characteristics can be 
found. 

Dotplots comparing one TgTAS-like with itself revealed 
the presence of internal repeats (Additional file 2B). Three 
blocks containing tandem repeats were identified and 
named "X gondii Telomeric Associated Repeat Element 
(TgTARE)" 1 to 3. These repetitive blocks were separated 



by non-repetitive DNA fragments which we denomi- 
nated as Fragments A to C, respectively (Figure 2A, 
Additional file 3A). Each TgTARE was analyzed with 
Tandem Repeat Finder [36]. In all cases, the tandemly 
repeated unit was similar but not identical in size 
and sequence. Their sizes were ~ 350 bp or ~ 680 
bp for TgTAREl, ~ 250-300 bp for TgTARE2, and 
~ 250-300 bp for TgTARE3 (Additional file 3B). The 
identity of these repeats was analyzed by BLAST. 
Interestingly, TgTAREl belongs to the satellite DNA 
Sat350/Sat680 family already described to be near Tox- 
oplasma telomeres by Bal31 digestion [37], and also 
detected in Neospora caninum [38] . Another Toxoplasma 
element already described to be present at the end of 
several chromosomes is TgIRE [12,39]. This element 
was found to be part of every TgTASL, located con- 
tiguous to the TgTARE3 in Fragment C (Figure 2B, 
Additional file 3A). 

We also inspected annotations and functional genomics 
data available at the ToxoDB resource. Except for one 
gene localized at the end of every Fragment C (see 
Figure 2B, and Additional file 3A), the lack of tran- 
script expression and protein expression data mapped 
to these regions suggest that TgTASL are mostly silent 
and/or gene-free DNA. The single gene located at the 
end of Fragment C is a hypothetical protein annotated 
as a member of the 'Toxoplasma Specific Family' (TSF) 
[5]. Interestingly, the chromosomal location of TSF family 
members is restricted to the TgTASL region. Based on this 
observation, we could identify two extra members of this 



Table 1 Localization of putative TgTASL regions 



ID 


chr 


ME49 


GT1 


VEG 


TgTASL_la 


la 


1-15692(15691) 


1-22843 (22842) 


1-25042 (25042) 


TgTASL_ll 


II 


2217257-2260968(43711) 


2241754-2265002(23248) 


2245054-2290271 (45217) 


TgTASL_lll-a 


llP'2 


2446355-2471000(24645) 


70927-103813(32886) 


9692-53051 (43359) 


TgTASLJII-b 


III 


2291623-2321755(30132) 


2287172-2317147(29975) 


2285863-2316508(30645) 


TgTASL_IV 


IV 


2179605-2207233(27628) 


2212915-2240672(27757) 


2244426-2273211 (28785) 


TgTASL_V 


V 


100060-130829(30769) 


16-29792 (29776) 


432280-461875 (29595) 






1-2528 (2527)^ 






TgTAS-L_VI 


V|3 


1-6928 (6927)^ 


3642075-3650379(8304) 


1713506-1741000(27494) 






1-7727(7726)^ 






TgTASL_IX 


IX 


74193-94753(20560) 


116577-138592(22015) 


72421-100418(27997) 


TgTASL_X-a 


X 


2008-28943 (26935) 


1-26609 (26608) 


40642-71653(31011) 


TgTASL_X-b 


X 


65038-90348 (25310) 


59949-85376 (25427) 


120781-147164(26383) 


TgTASL_XI 


XI 


634-27485 (26851) 


1-29121 (29120) 


1-27453(27453) 


TgTASL_XII 


XII 


24-16153(16129) 


44111-88494(44383) 


55337-111064(39762) 



Chromosome ranges correspond to the start and end coordinates of each TgTAS-like in the corresponding chromosome. 
^TgTASLJII-a in GT1 strain is present in the scaffold scf_11 070009991 96. 
^TgTASLJII-a in VEG strain is present in the scaffold scf_11 04442825704. 

3TgTASL_VI in ME49 strain can be assembled from the contigs: DS984876 (a), DS984831 (b), and DS984825 (c). 
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Figure 2 Internal structure of Telomeric-associated sequences in Toxoplasma. The figure shows a schematic representation of the internal 
organization of7oxop/Gsmatelomeric-associated sequences - lil<e. A. represents the consensus patern TSF = Toxoplosmo Specific Family [5];TgTARE = 
7oxop/6/smc/ Telomere Asociated Repetitive Elements, TgIRE = Toxoplasma Interspersed Repetitive Element [38]. B. Scaled representation of each 
TgTASL. The (+) and (-) next to the TgTASL name, means sense or antisense DNA strand location, respectively. The number of copies of each TgTARE 
is denoted with a subscript. Triangles point the sequences analyzed by ChIP- qPCR. 



family: TGME49_098060 as TSFll, and TGME49_100990 
as TSF12 (Figure 2B, Additional file 3C). All these find- 
ings confirm that TgTAS-like regions have a conserved 
canonical structure. 

All TSF genes are apparently transcribed; they have 
associated EST and RNA-seq data, as well as H3K9ac 
and H3K4me3 peaks that indicate the presence of an 
active promoter. In addition, the transcript levels of these 
genes seem to vary among parasite strains (type I, II, and 
III), stages (tachyzoite, bradyzoite, sporozoite), and/or cell 
cycle (Additional file 3C, expression evidence at ToxoDB 
v7.3). However, only TgTSFS has evidence of expression 
at the protein level (mass spectrometry evidence at Tox- 
oDB v7.3). This behavior is similar to the one described for 
subtelomeric genes. Further studies should be performed 
to elucidate the role of TSF members. 



Comparing TgTASL chromosome coordinates (Table 1) 
and their DNA strand location (Figure 2), it is clearly 
evident that these genomic elements have a defined ori- 
entation. TgTARE 1 is located at the beginning of the 
TgTASL, whereas the TSF is at the end of the element, 
being in the forward strand when it is located at the 
beginning of the chromosome, and in the reverse strand 
when it is located at the end of it. Therefore, TgTARE 1 
will be closer to the telomere and TSF toward the cen- 
tromere. This is true for all TgTASL except for TgTASLJV 
and TgTASL_IX, which are in an inverted orientation: 
they are both at end of the chromsome but in the for- 
ward strand. Interestingly, in the current genome assem- 
bly (ToxoDB v9.0) the TgTASLJX is now at the end of 
chromosome XII in the reverse strand, as observed for 
the rest of the TgTASL sequences (Additional file 3G); 



Dalmasso etal. BMCGenomics 2014, 15:21 
http://www.biomedcentral.eom/1 471 -21 64/1 5/21 



Page 6 of 1 5 



hence, TgTASLJV is the unique TgTAS-like region with 
a different orientation. 

It is expected that subtelomeres start next to the tract 
of telomeric repeats. In Toxoplasma the TgTASL are 
located at the ends of chromosomal assemblies, but their 
proximity to the telomeric repeats is hard to determine 
because the chromosome ends are not completely assem- 
bled. In Toxoplasma the telomeric repeat seems to be 
TTTAGGG. Tracts of this repeat are assembled only at 
the end of chromosomes la, X and XI, and at the begin- 
ning of chromosomes III and XI. Only one TgTASL is 
close to them, TgTASL_XI, which is next to the telom- 
eric repeat (data not shown). In other cases (TgTASLJa, 
_X, and _XII), where telomeric repeats are not assembled, 
the TgTASL are the first or last sequence region in assem- 
bled chromosomes (Table 1, Additional file 3F). Except for 
TgTASLJV, all TgTAS-like are within the 6% final por- 
tion of each chromosome (Additional file 3F). Therefore, 
in general, these regions are either next to or in close 
proximity to telomeres. Recently, more repeat tracts of 
TTTAGGG have been assembled (ToxoDB v9.0). They 
can be observed at both ends of chromosomes lb. III, XI, 
IV y Vila, at the beginning of chromosomes VI and Vllb, 
and at the end of chromosomes la, IX y XII. They appear 
in the forward strand when they are the first chromosomal 
sequence element, and in the reverse strand when they 
are the last one. In this assembly, there is another TgTASL 
close to a telomeric repeat, TgTASL_III-a. There are ~ 50 
Kb between the telomeric repeat and this TgTASL, where 
there are not annotated genes (Genome Browser in Tox- 
oDB v9.0). Other curious observation is that the telomeric 
repeat at the end of chromosome IV is not so close to the 
end, and it is in the forward strand. Interestingly, it is ~ 60 
Kb upstream the TgTASL_IV, which is the unique TgTASL 
at the end of a chromosome in the forward strand. In addi- 
tion, the genomic sequence between the telomeric repeat 
and this TgTASL does not contain any annotated genes, as 
occur with TgTASL_III-a. 

TgTAS-like regions present a special nucleosomal 
composition, enriched in heterochromatin markers 

Several sequences located within TgTAS-like regions were 
previously reported to hold characteristics of silent chro- 
matin. Satellite DNA Sat350, (corresponding to TgTAREl) 
was shown to be enriched in the heterochromatin his- 
tone markers H4K20 mono- and tri-methylated [8,40]. 
In addition, there is a group of small RNAs identified 
in Toxoplasma that perfectly map to Sat350 repeats [40], 
probably contributing to their heterochromatin state. On 
the other hand, we already published that TgIRE, local- 
ized in Fragment C, is enriched in the heterochromatin 
markers H4K20mel and H2A.X [14]. Almost the rest 
of Fragment C corresponds to the TSF gene sequence. 
These genes are rich in H3K4mel whithin the entire gen 



body, as it is observed for most genes in Toxoplasma 
(ToxoDB Genome Browser). Based on the availability of 
these previous data, we decided to analyzed the pres- 
ence of these markers in sequences within Fragment A 
by ChlP-qPCR, together with the euchromatic histone 
markers H3ac, H2A.Z and H2B.Z [7,12,13]. The amplified 
sequences specifically match Fragment A of TgTASLJI, 
_V and _XII (triangles in Figure 2B); where it was possible 
to design primers in order to obtain unique and specific 
products. In all cases, the ChlP-qPCR profile observed 
was similar, being enriched in heterochromatin markers 
(Figure 3). The sagl promoter was used as a control of 
active chromatin [14]. These data increase the evidence 
supporting the idea that TgTAS-like regions are indeed 
silenced domains of the chromosomes, in agreement with 
the data presented above. 

TgTAS-like regions are conserved in three T, gondii 
evolutionary lineages 

Initial studies showed that the majority of T gondii strains 
isolated in Europe and North America belong to three 
distinct clonal haplotypes, called types I, II, and III [41]. 
In this section, the analysis performed with the ME49 
(type III) genome data was extended to the genomes of 
the GTl (type I) and VEG (type II) strains (data avail- 
able in ToxoDB). Using the same BLAST strategy to detect 
similarity among chromosomes, all the TgTAS-like ele- 
ments detected in ME49 were also detected in these two 
lineages (Table 1). In addition, we compared all iden- 
tified TgTASL among strains. In all cases, the TgTASL 
regions were syntenic (Additional file 4), and the pair wise 
identity between each TgTASL and its syntenic partners 
in other strains was >96%. Furthermore, a new TgTASL 
regions was found in chromosomes VI of the VEG and 
GTl strains (Table 1), which was not previously detected 
in ME49 chromosomal assemblies. We took advantage of 
the high identity between TAS-like regions in the three 
lineages, and used TgTASL_VI of GTl and VEG to per- 
form BLAST searches in ME49 contigs that were left 
out of the chromosomal assembly. Sequences correspond- 
ing to a putative TgTASL_VI could be identified in the 
ME49 contigs DS984876, DS984831, DS984825 (Table 1). 
The presence of additional sequences similar to TgTAS- 
like was examined in contigs and scaffolds, in the three 
strains: ME49, GTl and VEG. Several sequences were 
retrieved, including contigs/scaffolds containing a com- 
plete TgTASL (Additional file 3E). These could represent 
assembly artifacts, or bona fide TgTASL regions that could 
not be accommodated properly in the current assembly. 
In total, we detected twelve TgTASL regions distributed in 
10 chromosomes (Table 1). 

Recently, a new Toxoplasma ME49 genome assem- 
bly was made available in ToxoDB (version 9.0 release, 
September 2013). Although most of the TgTASL regions 
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Figure 3 Nucleosome composition of TgTASL. Chromatin immunoprecipitation followed by qPCR (ChlP-qPCR) was carried out using anti-H2A. 
X, -H2A.Z, -H2B.Z, -H3ac, and -H4K20me3 antibodies. A rabbit preimmune antibody was used as negative control (CTR). The amount of TgTASLJI, 
V, and XII associated to each histone was determined by qPCR, and represented as % Input (where the Input is the 1 0% of total parasite lysate used 
per reaction). 



are conserved in the two assemblies, some differences 
could be observed. In the newest version there are two 
extra TgTASL regions, one at the opposite end of chromo- 
some II, and one in chromosome X, upstream of TgTASL_ 
X-a (Additional file 3F); in addition to the relocation 
of TgTASL_IX. Overall, this suggests, together with the 
sequences found in scaffolds, that there may be additional 
TgTAS-like sequences. The current draft nature of the 
T. gondii genome currently prevents us from knowing if 
both chromosome ends are covered with these TAS-like 
sequences. 

Finally, we also investigated the conservation of these 
TgTAS-like regions in other coccidian genomes avail- 
able in ToxoDB {Neospora caninum and Eimeria tenella), 
BLAST searches using TgTASL sequences as query only 
revealed matching sequences with low similarity to these 
genomes. Besides, TgTASL sequences do not present 
detectable similarity to other Apicomplexan genomes 
available in EuPathDB. By using the same intra-species all- 
vs-all chromosomal comparisons used before, we could 
identify two putative N, caninum TAS-like regions in Nc 
chromosomes Vila and Vllb, as well as other related 
DNA fragments (Additional file 3D, Additional file 5). The 



presence of a conserved structure of repetitive elements 
in these NcTASL regions were evaluated by pairwise com- 
parisons using Dotter. Dotplots with representative results 
are shown in Additional file 5. In this analysis it is clear 
that NcTASL also present a conserved pattern, consist- 
ing of two blocks of repeats separated by non- repetitive 
DNA (first two dotplots of the first two rows in Additional 
file 5). However, the similarity between NcTASL and 
TgTASL is extremely low (three last panels of the first 
two rows in Additional file 5). Consequently, TgTAS-like 
regions are composed of sequences that are unique to 
Toxoplasma, 

TgTAS-like show a unique bias in trinucleotide composition 

Because TgTAS-like regions are depleted of genes and 
are localized near chromosome ends, we reasoned that 
they might present a different base composition from the 
rest of the genome. We thus investigated a number of 
ways to identify compositional bias in these sequences. 
For this, we analyzed the nucleotide composition of these 
regions using different metrics, which are all independent 
of linear sequence similarity. The trinucleotide compo- 
sition of TgTAS-like sequences was used in our case to 
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identify diagnostic biases. Our measure of compositional 
bias simultaneously considered all 64 trinucleotides across 
a given sequence using a multivariate statistical technique 
(Correspondence Analysis, see Methods). 

Correspondence Analysis (CA) is a powerful statistical 
technique to explore high-dimensional categorical data. 
The trinucleotide composition of a DNA sequence is 
one such dataset; with each dimension/column measuring 
the count of a different trinucleotide (there are 64 trin- 
ucleotides/dimensions) in different genomic fragments 
(rows). The aim in a correspondence analysis is to identify 
trends in the variation of data and associations among dat- 
apoints. CA has been successfully applied to the study of 
synonymous codon usage bias associated with high levels 
of expression [42-45] . Using this technique, it is possible to 
summarize and explore most of the 64-dimensional infor- 
mation using only a few representative axes. This allows 
visual representations in 2- dimensional plots (also known 
as "CA maps") that capture the most influential trends 
that define the compositional bias. 

We performed a CA of trinucleotide composition of 
the Toxoplasma ME49 genome, using different frag- 
ments lengths (see Methods). This analysis allowed us to 
compare the trinucleotide composition of TgTASL frag- 
ments to that of the genome. As a result of this analysis 
we found that TgTASL regions display a unique trinu- 
cleotide composition. They are highly enriched in TAA, 
TAG, ATT (and their reverse complements TTA, CTA, 
AAT), moderately enriched in other trinucleotides such 
as TAG/CTA, ATG/CAT, TGA/TCA, while depleted of 
AGA/TCT, CTC/GAG and CGC/GCG (Figure 4). In this 
compositional signature, both the enrichment and the 
under-representation of some trinucleotides is important 
to explain TgTASL composition preferences. This com- 
positional bias was not detected at 1 Kb nucleotide frag- 
ments, but it was significant at 10 Kb, and peaked at 40-50 
Kb resolution (Figure 5 left axis, relative bias), which is 
the average size of TgTASL regions. For larger window 
sizes, the composition of different TgTAS-like containing 
fragments begins to diverge due to mixture of different 
genomic contexts within the same window (Figure 5, right 
axis, TgTASL fragments compactness). CA maps for all 
window lengths are shown in Additional file 6. Inter- 
estingly, a similar trinucleotide bias was also observed 
in centromeres (Figure 4, Additional file 6), which are 
also gene-depleted genomic regions [26], but without 
any similarity at sequence level among them and/or with 
TgTAS-like regions. 

In the CA analysis mentioned above (window size = 40 
Kb), the major compositional trend, represented in the 1st 
principal coordinate (horizontal axis of Figure 4), explains 
45% of the information on the variability of trinucleotide 
frequencies. This means that CA is effectively summariz- 
ing high-dimensional data, due to the presence of strong 



trends (biases). The second trend, represented as the 2nd 
principal coordinate (vertical axis in Figure 4), which is 
independent to the 1st one (CA produces orthogonal axes 
by design), is lead by one trinucleotide pair (ATA, TAT). 
Upon further inspection, we found that this trend is due 
to the presence of sequences with long TpA dinucleotide 
repeats, which are not highly represented in TgTAS-like 
regions. The relative contribution of each trinucleotide 
to the Irst and 2nd Principal Coordinates are listed in 
Additional file 3H. 

The unique trinucleotide bias of TgTAS-lilce sequences is 
shared with other gene-poor genomic regions in the T. 
gondii genome 

We have identified many non-TgTAS-like regions that 
show a trinucleotide composition bias that is similar to 
that observed in TgTAS-like regions (Figure 4, previ- 
ous section). A group of such non-TgTASL regions are 
the centromeric sequences, which are also gene-poor 
regions. To further explore the observed relationship 
between trinucleotide bias and coding potential we cal- 
culated a gene density index across the genome, and 
compared this to the trinculeotide bias (IPC in Figure 4). 
The first principal coordinate was negatively correlated 
to gene density across the genome (Pearsons correla- 
tion coef. -0.445, P-value < 10£~^^). Genomic fragments 
with low IPC values are exclusively gene-rich while high 
IPC values are associated to gene-poor or gene-depleted 
regions, such as TgTAS-like regions and centromeres 
(Figure 6). This is also consistent with the observed over- 
representation of stop codons in sequences with high 
IPC values. Genomic regions with similar TgTAS-like 
compositional bias were identified at every chromosome 
end (Additional file 7), which might correspond to other 
subtelomeric sequences. However, other regions with 
a similar compositional bias were also detected across 
all chromosomes, probably denoting gene-poor regions 
and/or heterochromatin, which should be confirmed 
experimentally. 

Discussion 

Telomeric and subtelomeric heterochromatin differs from 
standard heterochromatin on several aspects including its 
sequence, nucleosome composition, and nature of bind- 
ing factors. Even though these three components vary 
between species, telomeric function is conserved [17]. 
Here we have identified and characterized a number of 
Toxoplasma TAS-like regions, which are localized near 
several chromosome ends, and are specific for T gondii. 
They have features similar to those observed in other 
subtelomeric sequences like blocks of repeats, stress- 
asociated genes, and a heterochromatin-like structure, 
based on the presence of informative histones and histone 
post-translational modifications. 
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Figure 4 Symmetric map (biplot) of trinucleotides and genomic fragments from a Factorial Correspondence Analysis (CA). Correspondence 
analysis was performed as described in Methods. The 1st and 2nd coordinates plotted in the map represent 65% of the total inertia. Data points 
that are close to the origin (0,0) have no compositional bias (homogeneous usage) while points that move away from the origin represent larger 
deviations. In the plot, data points corresponding to genomic fragments (N = 40 Kb) are shown in green, and are overlayed with data points 
corresponding to centromeric sequences in blue), data points corresponding to TgTAS-like sequences (in black), along with the 64 trinucleotides 
(in magenta). Opacity of colours is related to how well each point (its inertia) is represented in this 2 dimensional subspace. Plot interpretation: 
fragments (dots) that are close in the plot have similar trinucleotide composition, and trinucleotides (triangles) which are close to each other 
frequently appear in the same fragments. Distances between dots and triangles have no meaning, however, the directions of dots and triangles from 
the origin (0,0) are meaningful (e.g. fragments on the right of the origin use moreTAA, CTA, A^(and their reverse complements ^A, TAG, AAT) and 
less AGA, CTC and CGC (and reverse complements TCT, GAG and GCG), whereas the opposite is true for the fragments lying on the left of the origin). 



Even though the T. gondii genome was recently fur- 
ther sequenced and re-assembled, the subtelomeres of this 
important parasite were not identified or characterized. 
Telomeres and subtelomeric sequences are noticeably 
hard to assemble when using shotgun sequencing. This 
is because the telomeric repeat is identical in all chro- 
mosomes and therefore cannot be used to distinguish 
one telomere from another. Furthermore, they are not 
expected to be well represented in the Bacterial Artificial 
Chromosome (BAG) libraries used to guide the scaffold- 
ing and chromosomal assembly of draft genomes, due to 
the size-selection of recombinant DNA clones and the 
natural genomic instability of these regions. Finally, they 
provide large similarity segments that are rich in repeti- 
tive sequences, which are the cause of frequent assembly 
errors. Using the current chromosomal assemblies, we 
mapped TgTAS-like sequences to a conserved region of 
30 Kb located at the end of most chromosomes. We 
defined the extent of these regions based on the consensus 



TgTAS-like structure present in most chromosomes. 
However, it is reasonable to consider that the actual size 
of each TgTASL may be larger. In this regard, the fact 
that the trinucleotide compositional bias peaks at 40-50Kb 
provides an independent estimate for the extent of these 
regions. 

In all cases, the identified TgTASL regions displayed a 
similar structural pattern, consisting of three telomeric 
associated repeats (TgTARE 1-3), with a single member 
of the Toxoplasma Specific Family, separated by silenced 
DNA. These TgTASL regions have a clear chromosomal 
orientation, with the TgTARE 1 towards the telomere, and 
the TSF gene towards the centromere; similar to P. falci- 
parum TASs [28]. Nevertheless, several exceptions were 
detected in the ME49 genome that still remain in the new 
genome assembly (ToxoDB 9.0). In chromosomes III and 
X there are two consecutive TgTASL regions, and in chro- 
mosome IV, the TgTASL_IV is inverted (its TgTARE 1 is 
towards the centromere and the TSF gene towards the 
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Figure 5 Strength of the compositional bias signal of TgTASL sequences over different lengths of genomic fragments. The curve in blacl< 
shows the bias ratio of TgTASL-containing genomic fragments relative to the compositional bias of all genomic fragments at a given window 
length. For example, at 1 Kbp fragments, TgTASL containing fragments show a relative bias of 1 , ie. there is no TgTASL specific bias, while there is a 
peak at 50Kbp, with a relative bias of ~4.5 (ie, TgTASL have on average ~4.5 times more trinucleotide bias compared to all the genomic fragments). 
The curve in red shows the divergence in trinucleotide composition of TgTAS-like containing fragments. At 40Kbp the divergence presents a 
minimum, ie the TgTASL fragments are highly compact. 



telomere). These unusual features, which so far are unique 
to Toxoplasma genomes, as well as the presence of TAS- 
like regions at only one end of the chromosomes, could be 
explained by simple chromosomal rearrangements, prob- 
ably with functional consequences. However, they could 
also be explained by artifacts of the genome sequenc- 
ing and/or assembly. Several observations support this 
last possibility: changes in the chromosome location and 
number observed between TgTAS-like regions in two 
different genome assemblies (Additional file 3F); the relo- 
cation of TgTASL_IX which is now located at the end 
of chromosome XII with the expected orientation; and 
the absence of BAC-end sequences connecting TgTASL 
regions with the rest of the assembled chromosome, as 
in the case of TgTASL_IV (ToxoDB Genome Browser; 
see also Figure 4 in Khan et al, publication [4]). There- 
fore, further studies should be performed to shed light 
on these issues. Also future genome assemblies should 
pay special attention to the quality of the assemblies near 
chromosome ends. 

From human to yeast, the shelterin protein com- 
plex binds DNA telomeric repeats, and subtelomeric 
sequences are silenced as evidenced by the presence 
of telltale heterochromatin epigenetic marks such as 



histones H3K9me3 and H4K20me3, methylated DNA, 
and heterochromatin protein 1 (HPl) [46,47]. Telomeric 
and subtelomeric sequences are transcribed in a telom- 
eric repeat-containing RNA (TERRA), which regulates the 
telomere function [47]. Some telomeric binding factors 
have been described in R falciparum including PfSir2a, 
a sirtuin that may be important to maintain DNA as 
heterochromatin [31,48]; PfOrcl that binds to telomeres 
and TARE-3, and may be involved in forming the T-loop 
structure that prevents fusion between chromosome ends 
[48]; and PfSIP2 (ApiAP2 family member) that binds to 
TARE2, TARE3 and the upsB promoter of a var gene, 
and colocalizes with PfHPl that binds H3K9me3, sug- 
gesting that both proteins participate in the assembly of 
telomeric heterochromatin [49]. With the exception of 
TgChromol [10], none of these proteins have been char- 
acterized to date in Toxoplasma^ despite the fact that all of 
them have orthologs. TgChromol is the HPl chromobox 
homologue. A number of studies focused on this pro- 
tein provide additional evidence that TgTAS-like regions 
are embedded in a typical heterochromatin environment. 
Although TgChromol is mainly associated with pericen- 
tromeric heterochromatin, IFA-FISH experiments have 
shown that TgChromol is also in close proximity to 
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Figure 6 Trinucleotide bias vs gene density. Scatterplot of 
trinucleotide compositional bias (Correspondence analysis 1 rst 
Principal Coordinate) at 40Kb vs. gene coverage (ie, proportion of 
nucleotides in a genomic fragment, being part of a gene) for all 40K 
genomic fragments, overlapped by 39K. TgTAS-like associated 
fragments are painted black. A strong non-linear relation can be 
appreciated between the variables, where low gene density regions 
(including gene-depleted regions) present a particular trinucleotide 
composition (high 1 PC). Also, there is a negative linear component 
relating both variables (Pearson's correlation coefficient -0.445, 
p-value < 10-^^). 



repeats present at the end of chromosome la and IX at the 
nuclear periphery [10]. Interestingly, the chromosome la 
probe used by Gissot et al. [10] corresponds to TgTAREl, 
suggesting that TgTASL elements could be located at the 
nuclear periphery. In a separate set of ChIP- qPCR experi- 
ments, Braun etal, detected an enrichment of H4K20mel 
and H3K9mel heterochromatin marks associated with 
the Sat350 repetitive element (which would be TgTAREl) 
[40]. We also revealed the enrichment of H4K20mel and 
H2A.X (another silencing-associated histone in T, gondii 
[14]) in fragment A from three TgTASL, and also in TgIRE 
[14]. In addition, Braun et al, uncovered the presence of 
repeat-associated siRNAs in Toxoplasma, which map to 
Sat350 (TgTAREl) and Sat529a [40]. It remains to be seen 
if these satellite-associated RNAs are involved in hete- 
rochromatin formation and/or in the regulation of longer 
telomeric noncoding RNA resembling the TERRA RNA 
from other organisms [50]. 

Genes that are close to subtelomeric regions have been 
implicated in a wide array of stress responses or niche 
adaptive roles [17]. This is also true for a number of impor- 
tant human pathogens, including T, brucei and T, cruzi, 
L, major, and P, falciparum, where so called contingency 
gene families are located embedded within or just next 
to the subtelomeric repeats. Toxoplasma contains sev- 
eral distinct, coccidian-specific multicopy gene families 



throughout its genome, including those that encode the 
SRS, ROPK, and SUSA proteins [51-53]. Recently, a bioin- 
formatics study showed that 60 out of the 144 ME49 SRS 
genes (42%) are located in subtelomeric sites [54]. How- 
ever, none of these genes are near TgTAS-like regions in 
the current genome assembly. We only detected one Tox- 
oplasma specific family (TSF) member [5]. These TSF 
members were also found in VEG and GTl T, gondii 
strains. Judging by the sequence, length, and number of 
TM domains, this family presents a high diversity among 
their members (Additional file 7C). Moreover, such diver- 
sity extends to their expression profiles, as the appar- 
ent expression of each TSF gene varies among different 
strains, life cycle stages, and/or throughout the cell cycle 
(Additional file 3C). Notwithstanding this, there is mass 
spectrometry evidence for only one TSF member, accord- 
ing with the experimental data available in ToxoDB ver- 
sion 7.3. Although the function of TSF genes is unknown, 
their expression profiles resemble those of tightly reg- 
ulated genes associated to parasite adaptation to the 
environment (differences between species), or virulence 
(differences between strain). Hence, they could be pro- 
posed as Toxoplasma stress and/or contingency response 
genes. Notably, recent RNAseq experiments revealed 
some discrepancies between transcripts and annotat- 
ed genes, and supported the existence of new genes 
and/or putative non-coding RNAs ([55], and RNAseq 
evidences in ToxoDB v9.0). Some of these RNAseq 
transcripts can be detected within several TgTASL, sug- 
gesting there might be more genes present in these 
regions. 

The trinucleotide composition of TgTASL DNA dis- 
plays a strong bias when analyzing genomic fragments 
greater than ~5 Kb. However this bias it not evident 
below this size, indicating that this is a long-range bias. 
The major compositional trend is associated with the rel- 
atively high frequency of stop codons in these regions, in 
correlation with the absence of genes. However, the trinu- 
cleotides TAA, TAG and TGA (together with their reverse 
complements) which are read as Stop codons within cod- 
ing sequences, only contribute to 17.4% of this major 
trend. Other trinucleotides, not particularly associated 
with non-coding sequences, such as the pairs CTC/GAG, 
AGA/TCT, CGC/GCG, AAT/ATT, GTA/TAC have larger 
contributions to this major trend (in Additional file 3H). 
For example, the trinucleotide pairs AAT/ATT (enriched) 
and AGA/TCT (depleted) together represent 19.6% of 
the bias in the major trend, and show a clustering pat- 
tern of TgTAS-like and centromeric sequences similar 
to the one observed for the stop codons (Additional 
file 8). 

Interestingly, the second most relevant trend at ~40 
Kb is led by a single trinucleotide pair ATA/TAT. These 
trinucleotides also represent the major trend (IPC) at 
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shorter fragments (1Kb). Although, there are other AT- 
rich trinucleotides such as ATA/ TAT and TTA/TAA, they 
define two major independent trends, respectively. The 
trinucleotides ATA/TAT (but not TTA/TAA), which also 
govern the compositional bias at shorter window sizes, are 
part of microsatellite-like sequences (TpA dinucleotide 
repeats). These TpA tracts have the lowest base stacking 
energy, and therefore the greatest flexibility for unwind- 
ing the DNA (like in the TATA box); consecuently, they 
provide interesting functional features to these genomic 
territories in T. gondii [56]. 

All observed trinucleotide frequencies were approx- 
imately identical to the frequency of their reverse- 
complementary trinucleotides in the X gondii genome 
(Figure 4). These correlations between a trinucleotide and 
its reverse complement were evident even at windows 
of size 5 Kb, and strongly increase for larger fragments. 
We constructed these CA maps with trinucleotide counts 
obtained from a single-strand of each chromosome to 
show an important consequence of this fact: TgTAS-like 
regions located in the positive strand cluster together with 
negative-strand TAS (i.e. their trincucleotide composi- 
tions are highly similar), even though their sequences are 
completely different (they are reverse-complements). This 
forward/reverse- complement symmetry of trinucleotide 
frequencies is part of a more general property: Chargaff s 
second parity rule. This rule states that within a sin- 
gle strand of double-stranded DNA, each oligonucleotide 
occurs with approximately the same frequency as its 
reverse-complement [57]. This observation has been ver- 
ified for "sufficiently long" (>100 Kb) genomic sequences, 
in a wide range of sequenced genomes of bacteria and 
eukaryotes [58]. Here we confirm that this is also true 
within the TgTAS- like regions of ~40 Kb in T, gondii. 

Finally, it is also noteworthy that the major trend (IPC) 
in trinucleotide composition allows the separation of 
TgTAS-like regions from the rest of the genome. This 
essentially means that TgTASL represent some of the 
most peculiar regions of the genome when looked at the 
level of trinucleotide composition. Other genomic regions 
that share the same compositional bias are centromeric 
fragments, most of the chromosome ends in the current 
assembly, as well as other chromosomal internal regions 
(see Additional file 7). We propose that this composi- 
tional bias may be associated with a more general prop- 
erty of these regions, such as their preferred chromatin 
state. However, further studies will need to be performed 
to elucidate the functional features shared by all these 
regions in T, gondii. 

Conclusions 

In this work we have identified and described a num- 
ber of chromosomal regions, that present a special long- 
range compositional bias, are gene-poor, and enriched 



in repeats and in heterochromatin-associated histones. 
These genomic domains are usually present in subtelom- 
eric and subtelomeric-like regions. In addition, this long 
range compositional bias was also detected in other Tox- 
oplasma gene-poor genomic regions that do not share 
any sequence similarity among them. Future studies are 
planned to further study the protein(s) associated to these 
TgTAS-like regions as well as the role of these elements in 
Toxoplasma biology. 

Methods 

Sequence resource, analysis and representation 

The genome sequence, coding sequences, and gene 
annotations were downloaded from ToxoDB version 7.3 
(www.toxodb.org). Some results were compared with 
the current version of ToxoDB v9.0. R falciparum 3D7 
genome assembly was download from PlasmoDB v8.2 
(www.plasmodb.org). Comparative genomics were per- 
formed using BLAST version 2.2.25 obtained from the 
NCBI [59], and Web ACT [60]. BLAST results were ana- 
lyzed and visualized with the Artemis Comparative Tool 
(ACT, http://www.sanger.ac.uk/Software/ACT) [61]. Cir- 
cular chromosome layouts were made with circos (http:// 
circos.ca/) [62]. The size and repeats of TgTASL regions 
were determined by visual inspection of chromosomal 
dotplots generated with Dotter (http://sonnhammer.sbc. 
su.se/Dotter.html) [35]. To facilitate the pairwise compar- 
ison of all chromosomes, we produced a dotplot compar- 
ing a multifasta file containing all chromosome fragments 
encompassing TgTASL, whose size was estimated using 
ACT. 

ChlP-qPCR experiments 

Chip and qPCR were performed as described in Dalmasso 
et al [14]. Briefly, 1 x 10^ tachyzoites were used per 
reaction. The 10% of total lysate was used as input. 
The antibodies used to perform the immunoprecipita- 
tion were: a-H2A.X, a-H2A.Z, and a-H2Bv polyclonal 
antibodies previously generated in the laboratory [12,14]; 
a-H3ac (Upstate 17-615), and a-H4K20mel (Abeam 
ab9051). The specific primers used to amplify the TgTAS- 
like are listed in Additional file 3G. They were designed 
to specific amplify each TgTASL, after sequence align- 
ment among them. Sagl promoter was used to represent 
a transcriptionally active chromatin region [14]. Three 
independent experiments were performed. 

Trinucleotide Correspondence Analysis (CA) 

T, gondii ME49 chromosomes (ToxoDB version 7.3) were 
scanned through overlapping windows of varying length 
L, by 1 Kb nucleotide steps (where L ranged from 1 Kb 
to 80 Kb). A number of contigency tables of size 64 x L 
were built by counting the number of each trinculeotide in 
every genomic fragment of length L. CA is an exploratory 
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multivariate statistical technique used to decompose the 
contingency table variability and extract coordinates that 
explain major trends in the inertia of the table [63]. Inertia 
is defined as the total Pearsons Chi- Square of the 2-way 
table divided by the total sum and can be interpreted as 
a global measure of the data's deviation from expected 
values under the hypothesis of homogeneity (where all 
genomic fragments would have similar trinucleotide pro- 
files). Correspondence Analysis was performed using the 
R package CA ([63]). CA plots shown in Additional file 6 
are symmetric maps displaying the 2 principal dimen- 
sions with most of the trinucleotide variability (rows and 
columns in principal coordinates). To determine trinu- 
cleotide bias associated with TgTAS-like regions, we con- 
sidered genome fragments containing complete TgTASL 
regions as well as those fragments where the annotated 
TgTASL region would cover >80% of the window length. 
Average deviation of these fragments {i.e. TgTAS-like trin- 
ucleotide bias, or inertia in the CA theory) from the origin 
(position representing the expected composition under 
homogeneity of trinucleotide frequencies) was calculated, 
and then divided by the average deviation of all genomic 
fragments to derive a measure of TgTASL-specific bias. 
This was repeated for all window lengths (Figure 5, left 
axis). The compactness of TgTAS- like fragments was 
evaluated by measuring disimilarity in trinucleotide space, 
calculated as the mean of all TgTAS-like inter- fragments 
distances relative to the mean inter fragments distances 
considering all genomic fragments (Figure 5, right axis). 
To explore the relation between trinucleotide bias (at 40 
Kb window length) and gene density, we calculated a gene 
coverage index defined as the proportion of bases covered 
by gene annotations (CDS, gene annotation from ToxoDB 
01/2012). The gene coverage index was plotted against a 
40 Kb trinucleotide bias index, which is simply the first 
principal coordinate of the CA. 

Additional files 



Additional file 1 : Pairwise comparisons between chromsomes. ACT 

visualization of chromosome similarities. The represerntation only shows 
the similarities between each chromosome (double grey lines, with 
coordinates in bp) and the contiguous chromosome, above or below [61]. 
Links in red genomic are genomic segments on forward strand, and in blue 
genomic segment on reverse strand. A. All P. folciparum chromosomes, not 
being filtered. B. T. gondii chromosomes containing TgTASL sequences 
(green box with a dashed line). The asterisks point to short TgTASL 
sequences present in other chromosomal regions. Sequences were filtered 
to be at least 180 bp long. 

Additional file 2: All-vs-all dotplot comparisons of TgTAS-like 
regions. Dotplots were generated from a multifasta file containing all 
TgTAS- like regions. The green lines separate each TgTASL. A. The 
complete dotplot of all TgTAS-like regions in the ME49 strain. The axes on 
the top and left show the size in Kb, whereas the ones on the bottom and 
the right shows the name of each TgTASL being compared. B. Schematic 
representation of the dotplot analysis, using part of a row of TgTASLJV. In 
the first panel the TgTASL_IV region was compared with itself, clearly 



showing the three blocks of repeats (TgTARE 1 to 3) being represented by 
several lines (one per repeat). In the rest of the panels TgTASL_IV was 
compared with TgTASL_V to _XII showing similarities among these TgTASL 
regions. Missing sequences, insertions and deletions are visible as gaps in 
the sequence comparison, for example in the last panel, where a run of Ns 
connecting two contigs in TgTASL_XII appears as a large gap in the dotplot. 

Additional file 3: Supplementary tables. This file contains a number of 
supplementary tables referenced in the text, or that provide additional 
support/material for the paper. Each table is contained within a separate 
spreadsheet in the file: AdditionalFileS.xIsx. A: Chromosomal locations of 
TgTAS and flanking TSF genes; B: Characteristics of T. Tel o me re- 
associated repetitive elements - like (TgTAREL); C: List of TSF genes 
retrieved from ToxoDB v7.3, and their relative expression; D: List of putative 
Neospora caninum TAS -like (NcTAS); E: List of contigs and scaffolds with 
similarity to TgTASL; F: Comparison of TgTASL between ToxoDB versions 
7.3 and 9.0 G: List of oligonucleotide primers used to amplify TgTASL 
regions. H: Table of compositional bias contribution per trinucleotide. 

Additional file 4: Comparative genomics of different T. gondii strains. 

Chromosomal similarity visualizations using ACT. BLASTN similarities across 
chromosomes are shown as red (forward strand), or blue (reverse) 
segments. Similarity between TgTAS regions is shown with yellow 
segments. The figure shows a comparative genomics analysis of all 
chromosomes containing TgTASL in three T. gondii strains: GTl (type I), 
ME49 (type II) and VEG (type III). The asterisk next to the ME49 
chromosome VI is indicating that at the end of the chromosome there are 
three additional contigs (light blue) containing sequences similar to 
TgTASL_VI in the GTl and VEG strains. 

Additional file 5: Dotplot of Neospora caninum vs T. gondii TAS-like 
sequences. The presence of conserved patterns in the NcTASL, and their 
lack of similarity against TgTASL, were evaluated by all-vs-all pairwise 
comparison using Dotter. The dotplot includes the 2 putative NcTASL and 
3 representative TgTASL. 

Additional file 6: Correspondence analysis Maps of genomic 
fragments of 1 to 80 Kb. This supplementary file contains symmetric 
biplots similar to those in Figure 4. The PDF file contains a succession of 
maps obtained for increasing genomic window sizes. 

Additional file 7: Visualization of TgTAS-like compositional bias 
along Toxoplasma chromosomes. A schematic representation of 
chromosomes is depicted where the major trend in trinucleotide 
compositional bias (first principal coordinate at N = 40 Kb) is encoded with 
a color gradient going from cyan (negative values in 1 st coordinate, see 
Figure 4) to magenta (positive values in 1 st coordinate), and passing 
through white (zero, no bias). The position of TgTAS-like and centromeric 
regions are marked with black and blue boxes, respectively. 

Additional file 8: Example scatterplots of trinucleotide abundance. 

These plots show the distribution of all 40 Kb genomic fragments according 
to trinucleotide counts of selected trinucleotides. A. This panel shows a 
symetric biplot for two pairs of trinucleotides: TAA^A and TAG/CTA, 
which are read as STOP codons in coding sequences. These contribute 
with a 12.7% of the major bias trend when considering all trinucleotides 
(see text). B. This panel shows a symetric biplot for two other influent 
trinucleotides AWAATand TCT/AGA, contributing with a 19% of the major 
trend. The two axes show the number of trinucleotide counts in a window 
of 40 Kb. In both plots the fragments containing TgTAS-like regions are 
displayed in black and those containing centromeric fragments in blue. 
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